Group Member: Yuwei_Zhu, Junghsuan_Chu, Junming_Huang, Qiutong_Dong
Our Final project is the one of topics from Kaggle competition:
Recruit Restaurant Visitor Forecasting - Predict how many future visitors a restaurant will receive?
This is a relational dataset from two reservation systems. Each restaurant has a unique air_store_id and hpg_store_id. Note that not all restaurants are covered by both systems, and that we have been provided data beyond the restaurants for which we must forecast.
In this competition, we will have to use panel data to forecast problems centered around restaurant visitors. The dataset is provided by Recruit company, which owns reservation systems(like Yelp) and a reservation control and cash register system(Air system).
To be specific, We also include the weather data in our analysis, which will be make our prediction more close to the real world situation.
Running a thriving local restaurant isn’t always as charming as first impressions appear. There are often all sorts of unexpected troubles popping up that could hurt business, especially in current fast-changing market.
For instance, how many people are using the App to make reservation? Does these users go to the restaurant after make those reservations? Should I worry about the weather that will affact my reserve visitors?..etc.
One common predicament is that restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members. This forecast isn’t easy to make because many unpredictable factors affect restaurant attendance, like weather and local competition. It’s even harder for newer restaurants with little historical data.
Therefore, we are interested in digging deeper into the issue of forecasting number of visiting customers for restaurants on daily basis.
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(caret)
library(DataExplorer)
library(gridExtra)
library(scales)
library(doParallel)
library(tidyposterior)
library(ggthemes)
library(ggmap)
library(sp)
library(maptools)
library(maps)
library(RColorBrewer)
library(viridis)
library(ggridges)
library(leaflet)
library(htmltools)
library(stringr)
library(data.table)air_reserve <- read_csv("data/air_reserve.csv")
air_store_info <- read_csv("data/air_store_info.csv")
air_visit_data <- read_csv("data/air_visit_data.csv")
date_info <- read_csv("data/date_info.csv")
hpg_reserve <- read_csv("data/hpg_reserve.csv")
hpg_store_info <- read_csv("data/hpg_store_info.csv")
store_id_relation <- read_csv("data/store_id_relation.csv")
test_sample <- read_csv("data/sample_submission.csv")
weather <- read_csv("data/weather.csv")
station <- read_csv("data/nearest_active_station.csv")The Kaggle competition provides seven datasets contains different information about restaurants for us to analyze and one testing set for submission. Except for these basic information, the competition allows us to extract external weather data from Japan Meteorological Agency: http://www.jma.go.jp/jma/indexe.html to obtain potential useful attributes.
Below are brief introduction of these datasets.
This file contains reservations made in the air system. Note that the reserve_datetime indicates the time when the reservation was created, whereas the visit_datetime is the time in the future where the visit will occur.
air_store_id - the restaurant’s id in the air system
visit_datetime - the time of the reservation
reserve_datetime - the time the reservation was made
reserve_visitors - the number of visitors for that reservation
summary(air_reserve)## air_store_id visit_datetime
## Length:92378 Min. :2016-01-01 19:00:00
## Class :character 1st Qu.:2016-11-15 19:00:00
## Mode :character Median :2017-01-05 18:00:00
## Mean :2016-12-05 08:18:58
## 3rd Qu.:2017-03-03 19:00:00
## Max. :2017-05-31 21:00:00
## reserve_datetime reserve_visitors
## Min. :2016-01-01 01:00:00 Min. : 1.000
## 1st Qu.:2016-11-07 17:00:00 1st Qu.: 2.000
## Median :2016-12-27 22:00:00 Median : 3.000
## Mean :2016-11-27 01:13:07 Mean : 4.482
## 3rd Qu.:2017-02-26 18:00:00 3rd Qu.: 5.000
## Max. :2017-04-22 23:00:00 Max. :100.000
glimpse(air_reserve)## Observations: 92,378
## Variables: 4
## $ air_store_id <chr> "air_877f79706adbfb06", "air_db4b38ebe7a7ceff...
## $ visit_datetime <dttm> 2016-01-01 19:00:00, 2016-01-01 19:00:00, 20...
## $ reserve_datetime <dttm> 2016-01-01 16:00:00, 2016-01-01 19:00:00, 20...
## $ reserve_visitors <dbl> 1, 3, 6, 2, 5, 2, 4, 2, 2, 2, 3, 3, 2, 6, 7, ...
This file contains information about select air restaurants. Column names and contents are self-explanatory.
air_store_id
air_genre_name
air_area_name
latitude
longitude
Note: latitude and longitude are the latitude and longitude of the area to which the store belongs, not exact location for privacy concern.
summary(air_store_info)## air_store_id air_genre_name air_area_name latitude
## Length:829 Length:829 Length:829 Min. :33.21
## Class :character Class :character Class :character 1st Qu.:34.70
## Mode :character Mode :character Mode :character Median :35.66
## Mean :35.65
## 3rd Qu.:35.69
## Max. :44.02
## longitude
## Min. :130.2
## 1st Qu.:135.3
## Median :139.7
## Mean :137.4
## 3rd Qu.:139.8
## Max. :144.3
glimpse(air_store_info)## Observations: 829
## Variables: 5
## $ air_store_id <chr> "air_0f0cdeee6c9bf3d7", "air_7cc17a324ae5c7dc",...
## $ air_genre_name <chr> "Italian/French", "Italian/French", "Italian/Fr...
## $ air_area_name <chr> "Hyogo-ken Kobe-shi Kumoidori", "Hyogo-ken Kobe...
## $ latitude <dbl> 34.69512, 34.69512, 34.69512, 34.69512, 35.6580...
## $ longitude <dbl> 135.1979, 135.1979, 135.1979, 135.1979, 139.751...
This file contains historical visit data for the air restaurants.
air_store_id
visit_date - the date
visitors - the number of visitors to the restaurant on the date
summary(air_visit_data)## air_store_id visit_date visitors
## Length:252108 Min. :2016-01-01 Min. : 1.00
## Class :character 1st Qu.:2016-07-23 1st Qu.: 9.00
## Mode :character Median :2016-10-23 Median : 17.00
## Mean :2016-10-12 Mean : 20.97
## 3rd Qu.:2017-01-24 3rd Qu.: 29.00
## Max. :2017-04-22 Max. :877.00
glimpse(air_visit_data)## Observations: 252,108
## Variables: 3
## $ air_store_id <chr> "air_ba937bf13d40fb24", "air_ba937bf13d40fb24", "...
## $ visit_date <date> 2016-01-13, 2016-01-14, 2016-01-15, 2016-01-16, ...
## $ visitors <dbl> 25, 32, 29, 22, 6, 9, 31, 21, 18, 26, 21, 11, 24,...
This file gives basic information about the calendar dates in the dataset.
calendar_date
day_of_week
holiday_flg - is the day a holiday in Japan
summary(air_visit_data)## air_store_id visit_date visitors
## Length:252108 Min. :2016-01-01 Min. : 1.00
## Class :character 1st Qu.:2016-07-23 1st Qu.: 9.00
## Mode :character Median :2016-10-23 Median : 17.00
## Mean :2016-10-12 Mean : 20.97
## 3rd Qu.:2017-01-24 3rd Qu.: 29.00
## Max. :2017-04-22 Max. :877.00
glimpse(air_visit_data)## Observations: 252,108
## Variables: 3
## $ air_store_id <chr> "air_ba937bf13d40fb24", "air_ba937bf13d40fb24", "...
## $ visit_date <date> 2016-01-13, 2016-01-14, 2016-01-15, 2016-01-16, ...
## $ visitors <dbl> 25, 32, 29, 22, 6, 9, 31, 21, 18, 26, 21, 11, 24,...
This file contains reservations made in the hpg system.
hpg_store_id - the restaurant’s id in the hpg system
visit_datetime - the time of the reservation
reserve_datetime - the time the reservation was made
reserve_visitors - the number of visitors for that reservation
summary(hpg_reserve)## hpg_store_id visit_datetime
## Length:2000320 Min. :2016-01-01 11:00:00
## Class :character 1st Qu.:2016-06-26 19:00:00
## Mode :character Median :2016-11-19 20:00:00
## Mean :2016-10-15 06:55:20
## 3rd Qu.:2017-02-03 19:00:00
## Max. :2017-05-31 23:00:00
## reserve_datetime reserve_visitors
## Min. :2016-01-01 00:00:00 Min. : 1.000
## 1st Qu.:2016-06-21 12:00:00 1st Qu.: 2.000
## Median :2016-11-10 20:00:00 Median : 3.000
## Mean :2016-10-07 19:57:59 Mean : 5.074
## 3rd Qu.:2017-01-27 13:00:00 3rd Qu.: 6.000
## Max. :2017-04-22 23:00:00 Max. :100.000
glimpse(hpg_reserve)## Observations: 2,000,320
## Variables: 4
## $ hpg_store_id <chr> "hpg_c63f6f42e088e50f", "hpg_dac72789163a3f47...
## $ visit_datetime <dttm> 2016-01-01 11:00:00, 2016-01-01 13:00:00, 20...
## $ reserve_datetime <dttm> 2016-01-01 09:00:00, 2016-01-01 06:00:00, 20...
## $ reserve_visitors <dbl> 1, 3, 2, 5, 13, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5,...
This file contains information about select hpg restaurants. Column names and contents are self-explanatory.
hpg_store_id
hpg_genre_name
hpg_area_name
latitude
longitude
Note: latitude and longitude are the latitude and longitude of the area to which the store belongs
summary(hpg_reserve)## hpg_store_id visit_datetime
## Length:2000320 Min. :2016-01-01 11:00:00
## Class :character 1st Qu.:2016-06-26 19:00:00
## Mode :character Median :2016-11-19 20:00:00
## Mean :2016-10-15 06:55:20
## 3rd Qu.:2017-02-03 19:00:00
## Max. :2017-05-31 23:00:00
## reserve_datetime reserve_visitors
## Min. :2016-01-01 00:00:00 Min. : 1.000
## 1st Qu.:2016-06-21 12:00:00 1st Qu.: 2.000
## Median :2016-11-10 20:00:00 Median : 3.000
## Mean :2016-10-07 19:57:59 Mean : 5.074
## 3rd Qu.:2017-01-27 13:00:00 3rd Qu.: 6.000
## Max. :2017-04-22 23:00:00 Max. :100.000
glimpse(hpg_reserve)## Observations: 2,000,320
## Variables: 4
## $ hpg_store_id <chr> "hpg_c63f6f42e088e50f", "hpg_dac72789163a3f47...
## $ visit_datetime <dttm> 2016-01-01 11:00:00, 2016-01-01 13:00:00, 20...
## $ reserve_datetime <dttm> 2016-01-01 09:00:00, 2016-01-01 06:00:00, 20...
## $ reserve_visitors <dbl> 1, 3, 2, 5, 13, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5,...
This file allows you to join select restaurants that have both the air and hpg system.
hpg_store_id
air_store_id
summary(store_id_relation)## air_store_id hpg_store_id
## Length:150 Length:150
## Class :character Class :character
## Mode :character Mode :character
glimpse(store_id_relation)## Observations: 150
## Variables: 2
## $ air_store_id <chr> "air_63b13c56b7201bd9", "air_a24bf50c3e90d583", "...
## $ hpg_store_id <chr> "hpg_4bc649e72e2a239a", "hpg_c34b496d0305a809", "...
This file shows a submission in the correct format, including the days for which you must forecast.
id - the id is formed by concatenating the air_store_id and visit_date with an underscore
visitors- the number of visitors forecasted for the store and date combination
summary(test_sample)## id visitors
## Length:32019 Min. :0
## Class :character 1st Qu.:0
## Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
glimpse(test_sample)## Observations: 32,019
## Variables: 2
## $ id <chr> "air_00a91d42b08b08d9_2017-04-23", "air_00a91d42b08b0...
## $ visitors <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
This file contains weather information for every weather station.
visit_date
avg_temperature
high_temperature
low_temperature
precipitation
hours_sunlight
solar_radiation
deepest_snowfall
total_snowfall
avg_wind_speed
avg_vapor_pressure
avg_local_pressure
avg_humidity
avg_sea_pressure
cloud_cover
station_id - the join of a stationâs prefecture, first_name, and second_name, with “__" (double underscores)
summary(weather)## X1 visit_date avg_temperature high_temperature
## Min. : 0 Min. :2016-01-01 Min. :-30.9 Min. :-29.1
## 1st Qu.:129 1st Qu.:2016-05-09 1st Qu.: 4.3 1st Qu.: 9.2
## Median :258 Median :2016-09-15 Median : 11.7 Median : 16.9
## Mean :258 Mean :2016-09-15 Mean : 11.7 Mean : 16.4
## 3rd Qu.:387 3rd Qu.:2017-01-22 3rd Qu.: 19.5 3rd Qu.: 24.3
## Max. :516 Max. :2017-05-31 Max. : 32.5 Max. : 39.7
## NA's :379537 NA's :379582
## low_temperature precipitation hours_sunlight solar_radiation
## Min. :-33.8 Min. : 0.00 Min. : 0.0 Mode:logical
## 1st Qu.: -0.3 1st Qu.: 0.00 1st Qu.: 0.9 NA's:859771
## Median : 6.8 Median : 0.00 Median : 4.6
## Mean : 7.4 Mean : 4.86 Mean : 4.9
## 3rd Qu.: 15.6 3rd Qu.: 3.50 3rd Qu.: 8.4
## Max. : 29.6 Max. :422.00 Max. :15.1
## NA's :379582 NA's :219492 NA's :426298
## deepest_snowfall total_snowfall avg_wind_speed avg_vapor_pressure
## Mode:logical Mode:logical Min. : 0.0 Mode:logical
## NA's:859771 NA's:859771 1st Qu.: 1.4 NA's:859771
## Median : 2.0
## Mean : 2.5
## 3rd Qu.: 3.1
## Max. :23.3
## NA's :386764
## avg_local_pressure avg_humidity avg_sea_pressure cloud_cover
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:859771 NA's:859771 NA's:859771 NA's:859771
##
##
##
##
##
## station_id
## Length:859771
## Class :character
## Mode :character
##
##
##
##
glimpse(weather)## Observations: 859,771
## Variables: 17
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ visit_date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-0...
## $ avg_temperature <dbl> 20.5, 23.5, 21.7, 21.6, 22.1, 21.2, 20.5, 1...
## $ high_temperature <dbl> 22.4, 26.2, 23.7, 23.8, 24.6, 24.9, 22.3, 2...
## $ low_temperature <dbl> 17.5, 21.2, 20.2, 20.4, 20.5, 19.1, 19.5, 1...
## $ precipitation <dbl> 0.0, 5.0, 11.0, 11.0, 35.5, 11.5, 0.0, 41.5...
## $ hours_sunlight <dbl> 0.6, 3.6, 0.0, 0.1, 0.0, 0.8, 3.9, 0.0, 5.7...
## $ solar_radiation <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ deepest_snowfall <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ total_snowfall <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ avg_wind_speed <dbl> 6.3, 4.7, 2.8, 3.3, 2.4, 4.9, 6.8, 7.2, 5.7...
## $ avg_vapor_pressure <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ avg_local_pressure <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ avg_humidity <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ avg_sea_pressure <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ cloud_cover <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ station_id <chr> "okinawa__ohara-kana__oohara", "okinawa__oh...
This file describes the closest weather station to at least one store in air_store_info.
air_store_id
air_genre_name
air_area_name
latitude
longitude
latitude_str
longitude_str
station_id
station_latitude
station_longitude
station_vincenty
station_great_circle
Get ready, let’s start!
Before our journey starts, let’s preprare our data and join informative tables first.
For further analysis, we need to merge all abovementioned tables together to get the training dataset.
Currently, only the air_visit_data contains our target variable visitors. Therefore, we decided to start from this table and join other informative tables that we are interested in.
First, we merge three air system related tables together.
air_reserve_sum <- air_reserve %>%
mutate(visit_date = as.Date(visit_datetime)) %>%
group_by(air_store_id, visit_date) %>%
summarise(visitor_sum = sum(reserve_visitors))air_data <- left_join(air_visit_data, air_reserve_sum, by = c("air_store_id" = "air_store_id", "visit_date" = "visit_date"))
air_data[is.na(air_data)] <- 0
air_data <- air_data %>%
left_join(air_store_info, by = "air_store_id") %>%
mutate(reserve_visitor = visitor_sum) %>%
select(-visitor_sum)
air_dataTable store_id_relation contains the store id that matches store both in Air and HPG system. There are 829 stores in Air system and 2690 stores in HPG system. However, there are only 150 stores can be matched, which accounts for 18% of total number of stores in air system.
paste("There are", nrow(air_store_info), "stores in Air system.")## [1] "There are 829 stores in Air system."
paste('There are', nrow(hpg_store_info), 'stores in HPG system.')## [1] "There are 4690 stores in HPG system."
paste('There are',nrow(store_id_relation),'stores in both system.')## [1] "There are 150 stores in both system."
If we merger the reservation information of these 150 stores into the air system dataset, we might introduce more bias into our dataset because we cannot simply assume those not matched stores do not have other reservations (Other reservation system might exist in Japan). Thus, we decide to focus on Air system dataset and put HPG system data aside at this point of time.
In reality, people tend to go to restaurant during weekend and holiday. This fact indicates that date information can be great explanatory variable here. Consequently, we join date information into our dataset.
air_data <- air_data %>%
left_join(date_info, by = c('visit_date'='calendar_date'))Similar to date, weather has influence on visitor numbers as well. As mentioned before, we are allowed to extract external weather data from Japan Meteorological Agency.
We’ve collected the dataset includes the weather related information from 2016/01 to 2017/04 in our restaurants area. To choose the right weather data for each store, we have to first figure out the nearest station by using longitude and latitude and then merge the information into dataset.
Credit to Hunter McGushion: https://www.kaggle.com/huntermcgushion/exhaustive-weather-eda-file-overview/data
weather_station <- weather %>%
left_join(station, by = 'station_id')final_data <- left_join(air_data, weather_station, by = c("air_store_id" = "air_store_id", "visit_date" = "visit_date")) %>%
select(-X1) %>%
select(- c('station_id':'station_great_circle')) %>%
mutate(air_genre_name = air_genre_name.x) %>%
mutate(air_area_name = air_area_name.x) %>%
mutate(latitude = latitude.x) %>%
mutate(longitude = longitude.x) %>%
select(- c('air_genre_name.x':'longitude.x'))
paste('Now there are', nrow(final_data), 'observation with', ncol(final_data), 'variables.')## [1] "Now there are 252108 observation with 24 variables."
Data preparing done! To gather all the information wee need, we joined five tables and now there are 252108 observation with 24 variables in our dataset.
Now let’s working on Data cleaning - Missing values!
plot_missing(final_data)
From this plot, we can see the missing values in each variable and its missing percentages in the dataset. Since there are more than 95% of data in “deepest_snowfall” and “total_snowfall” variables which are missing, we decided to drop both variables. In addition, for the weather data missing percentages more that 25%, we also decided to drop them because weather data is dynamic and sometimes difficult to predict, the imputation might cause other problems and mislead the analysis.
final_data <- final_data %>%
select(- avg_sea_pressure, -avg_local_pressure, -avg_humidity, -avg_vapor_pressure, -cloud_cover, -precipitation, -solar_radiation, -deepest_snowfall, -total_snowfall)For those columns with less than 25% of missing value, we use the mean of other stores of each date to replace them.
missing_value <- final_data %>%
group_by(visit_date) %>%
summarise(hightemp_mean = mean(high_temperature, na.rm = TRUE),
lowtemp_mean = mean(low_temperature, na.rm = TRUE),
avgtemp_mean = mean(avg_temperature, na.rm = TRUE),
wind_speed_mean = mean(avg_wind_speed, na.rm = TRUE),
sunlight_mean = mean(hours_sunlight, na.rm = TRUE))final_data <- final_data %>%
left_join(missing_value, by = 'visit_date') %>%
mutate(high_temperature = case_when(is.na(high_temperature) ~ hightemp_mean,
!is.na(high_temperature) ~ high_temperature)) %>%
mutate(low_temperature = case_when(is.na(low_temperature) ~ lowtemp_mean,
!is.na(low_temperature) ~ low_temperature)) %>%
mutate(avg_temperature = case_when(is.na(avg_temperature) ~ avgtemp_mean,
!is.na(avg_temperature) ~ avg_temperature)) %>%
mutate(avg_wind_speed = case_when(is.na(avg_wind_speed) ~ wind_speed_mean,
!is.na(avg_wind_speed) ~ avg_wind_speed)) %>%
mutate(hours_sunlight = case_when(is.na(hours_sunlight) ~ sunlight_mean,
!is.na(hours_sunlight) ~ hours_sunlight)) %>%
select(-hightemp_mean, -lowtemp_mean, -avgtemp_mean, -wind_speed_mean, -sunlight_mean)
final_dataplot_missing(final_data)
Now, there are no missing values, our data is all complete. Let’s step into EDA part.
First step of EDA, let’s see how does the data look like overall.
introduce(final_data)plot_intro(final_data)
We have 252,108 records and 15 variables, 14 of which are predictors. Also, there are 5 discrete columns, 10 continuous conlumns and no missing columns or values.
What are the variables we have now?
names(final_data)## [1] "air_store_id" "visit_date" "visitors"
## [4] "reserve_visitor" "day_of_week" "holiday_flg"
## [7] "avg_temperature" "high_temperature" "low_temperature"
## [10] "hours_sunlight" "avg_wind_speed" "air_genre_name"
## [13] "air_area_name" "latitude" "longitude"
From the names of columns, it’s obvious that we have two types of variables in general:
1). AirREGI-related variable: variables provided by AirREGI, including air_store_id, visit_date, visitors, reserve_visitor, day_of_week, holiday_flg, air_genre_name, air_area_name, latitude, longitude.
2). Weather-related variable: variables providing weather information, including avg_temperature, high_temperature, low_temperature, hours_sunlight, and avg_wind_speed.
We will explore them seperately.
First, let’s have a general picture about our target variable - Visitors.
air_visits <- final_data %>%
ggplot(aes(visitors)) +
geom_vline(xintercept = 20, color = "Red")+
geom_histogram(bins = 30, alpha = 0.8) +
scale_x_log10()+
labs(title = "Distribution of Visitors")+
theme(plot.title = element_text(hjust = 0.5))
air_visits_outlier <-
ggplot(final_data)+
geom_boxplot(aes(x = "", y = visitors))+
labs(x = "",
y = "visitors",
title = "Boxplot of Visitors")+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(air_visits, air_visits_outlier, ncol = 2)From the above graph, we can see that the number of daily guests per restaurant peaks at around 20 (the red line).The distribution extends up to 100 and, in rare cases, beyond. To take closer look for those rare case, we use the boxplot to find out those outliers. There are some restuarants that have over 750 visitor in a specific day.
Since this is a panel data, let’s see how our target variable distributed across date.
final_data %>%
group_by(visit_date)%>%
summarise(visitors = sum(visitors))%>%
ggplot(aes(x = visit_date, y = visitors))+
geom_line(color = 'darkred')
Based on the line graph, there is an obvious rise in count of records for visit date after July 2016. We infer that this is due to wider use of AirREGI service and more restaurants join the app after July 2016.
After July 2016, the distribution of visit_date is even except for an obvious decline around July 2017. As mentioned by official description, this is due to a holiday week in Japan called the “Golden Week.”
Then let’s see the ditribution of daily visitors per restaurant:
There are a lot of variables that we are interested to know, and some of them might highly related to our target vairables. Before we explore more about their relationships with visitors, let’s explore other vairables first. We seperate the variables into three categories, and let’s explore them together now.
First, let’s discover where are these 829 restaurants locate.
Here we can see that most restaurants are located along the coastline, which makes sense since coastal areas are more developed in Japan. (You can interact with the map)
pref <- as.data.frame(str_split_fixed(air_store_info$air_area_name, " ", 2))
air_store <- cbind(air_store_info, pref[c(1)])%>%
rename(pref = V1)
# create the labels
labels <- sprintf(
"Store ID: %s</br>Prefecture:%s</br>Genre: %s", air_store$air_store_id, air_store$pref, air_store$air_genre_name) %>%
lapply(HTML)
air_store %>%
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(~ longitude,
~ latitude,
radius = 4,
label = labels,
clusterOptions = markerClusterOptions())#https://www.kaggle.com/captcalculator/a-very-extensive-recruit-exploratory-analysistop8 <- final_data %>%
group_by(air_area_name) %>%
count() %>%
ungroup() %>%
top_n(8,n) %>%
ggplot(aes(reorder(air_area_name, n, FUN = min) ,n, fill = air_area_name))+
geom_col() +
theme(legend.position = "none")+
coord_flip() +
labs(x = "Top 8 areas", y = "Number of air restaurants")+
scale_fill_brewer(palette = "Spectral")+
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = 'white'),
panel.grid = element_line(color = 'grey90'))
top8Here we only list the top 8 areas that have more restaurants. Among these areas, Fukuoka has the largest number of restaurants and then followed by other Tokyo areas.
What are the types of these restaurants?
# Plot the pie chart for different genres.
ggplot(data = distinct_final_data, mapping=aes(x="Genre",fill=air_genre_name))+
geom_bar(stat="count",width=0.5,position='stack')+
coord_polar("y", start=0)+
labs(title = "Piechart of the types of restaurants")+
theme(plot.title = element_text(hjust = 0.5))+
labs(x = '',y = '')# Calculate the ratos of Top 3 genres.
Izakaya_ratio <- nrow(subset(distinct_final_data, air_genre_name == "Izakaya"))/nrow(distinct_final_data)
paste("Izakaya accounts for", Izakaya_ratio, "of total number of restaurants")## [1] "Izakaya accounts for 0.237635705669481 of total number of restaurants"
Cafe_Sweets_ratio <- nrow(subset(distinct_final_data, air_genre_name == "Cafe/Sweets"))/nrow(distinct_final_data)
paste("Cafe/Sweets accounts for", Cafe_Sweets_ratio, "of total number of restaurants")## [1] "Cafe/Sweets accounts for 0.218335343787696 of total number of restaurants"
Dining_Bar_ratio <- nrow(subset(distinct_final_data, air_genre_name == "Dining bar"))/nrow(distinct_final_data)
paste("Dinning bar accounts for", Dining_Bar_ratio, "of total number of restaurants")## [1] "Dinning bar accounts for 0.130277442702051 of total number of restaurants"
From the pie chart, we can observe that Top 3 genres of restaurants in AirREGI are Izakaya, Cafe/Sweets, and Dining bar. After calculation, we figure out that their occupation ratios are 23.76%, 21.83%, 13.02%, respectively.
After understanding the restaurants, let’s explore the time period of our dataset.
time_data <- final_data
setDT(time_data)[, Month_Yr := format(as.Date(visit_date), "%Y-%m") ]
setDT(time_data)[, Year := format(as.Date(visit_date), "%Y") ]
period <- time_data%>%
group_by(Month_Yr)%>%
ggplot(aes(x=Month_Yr, fill = Year))+
geom_bar(alpha=0.8)+
scale_fill_brewer(palette = "Set1")+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
labs(x = 'Month-Year',
y = 'Frequency',
title = 'Time Span')+
theme(plot.title = element_text(hjust = 0.5))
periodFrom this plot, we can clear see that the periods of our dataset starts from January 2016 to April 2017. Moreover, we can see that the air system has more data collected from July 2016 to April 2017. Recall above, we think that it is due to wider use of AirREGI service after July 2016.
Now let break down the time period into day of week to see we can find any fluctuates.
most_day <- ggplot(final_data, aes(day_of_week))+
geom_bar(alpha=0.8,fill = "blue4")+
xlab("day of week") +
ggtitle("Bar chart for Day of Week")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
most_dayFrom the bar chart of days of week, we can learn that Friday is the most frequent day that people would choose to go out, and then Saturday and Thursday. Thursday is quite a surprising result since it’s a usual business day.
Next, let’s take closer look to Holidays.
distinct_date_data$holiday_flg <- factor(distinct_date_data$holiday_flg, levels = c("0", "1"))
holidayOrNot <- distinct_date_data %>%
ggplot(aes(x=holiday_flg, fill =holiday_flg)) +
geom_bar(alpha = 0.8) +
theme(legend.position = "none")+
facet_wrap(~Year)+
scale_fill_brewer(palette = "Set1")
holidayOrNotFrom this graph we can see that in both 2016 and 2017, the number of holidays(1) is less than nonholidays(0).
The reason that the number of holidays(0) in 2017 is so much less than 2016 is because the data in 2017 only contain four months.
Since 3/5 weather-related variables are about temperature, we pick it for detailed exploration. First, let’s have a look at how the temperature looks like in Japan all the year round.
avg_temp <- weather_data %>%
ggplot(aes(x = avg_temperature , y = fct_rev(Month_Yr), fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1., bandwidth = 1.4) +
scale_fill_viridis(name = "T_avg [°C]", option = "C") +
ggtitle("Average temperature of restaurant areas in 2016/01-2017/04") +
labs(x = "Average temperature", y = "") +
theme_ridges(font_size = 13, grid = TRUE) +
theme(axis.title.y = element_blank())+
theme(plot.title = element_text(hjust = 0.5))
avg_temp#ref:https://www.kaggle.com/headsortails/be-my-guest-recruit-restaurant-eda
From the above avg_temperature plot, we can see that, on average, there is a range of about 20 celsius degrees between winter and summer, and for every typical month, the temperature spread across around 10 degrees. Morever, the average temperature of Japan is above 0 Celsius almost all the year round. The average temperature is the highest from July to August in 2016, which is near 30 Celsius.
Let’s see the distribution of other weather factors.
weather_related <- weather_data %>%
select(-visitors, - avg_temperature, -high_temperature, -low_temperature)
plot_histogram(weather_related, ggtheme = theme_bw())weather_monthly_by_fec <- weather_data %>%
group_by(prefecture, Month_Yr)%>%
summarise(monthly_visitors = sum(visitors),
avg_temperature = mean(avg_temperature),
avg_high_temperature = mean(high_temperature),
avg_low_temperature = mean(low_temperature),
avg_hours_sunlight = mean(hours_sunlight),
avg_wind_speed =mean(avg_wind_speed))
ggplot(weather_monthly_by_fec)+
geom_point(aes(x = Month_Yr, avg_temperature),color="red", size= 1)+
geom_point(aes(x = Month_Yr, avg_wind_speed),color="blue",size= 1)+
geom_point(aes(x = Month_Yr, avg_hours_sunlight),color="purple",size= 0.8)+
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 6))+
facet_wrap(~prefecture)
Well, unfortunately, it seems that there are no significant difference of wind speed and sunlight between different months in a year. In other words, sunlight and wind speed have no impact on the amount of visitors as well.
(1). Does the type of restaurants influence the amount of visitors?
Here, we conduct analysis between air_genre_name ane the amount of visitors.
genre_visit <- final_data %>%
group_by(air_genre_name)%>%
summarise(visitors = sum(visitors))
most_genre <- ggplot(final_data, aes(air_genre_name))+
geom_bar(alpha=0.6,fill = "#338658")+
xlab("air genre name") +
ggtitle("Bar chart for Air Genre Name")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
most_visitor_genre <- ggplot(genre_visit, aes(x = air_genre_name, visitors))+
geom_bar(stat = "identity",fill="#c1c084")+
scale_colour_hue(c=45, l=80)+
xlab("air genre name") +
ylab("Visitors") +
ggtitle("Different Genres vs. Visitors")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(most_genre, most_visitor_genre, ncol = 2)
Form the two plots above, we can observe that the frequency distribution of air_genre_name and the distribution of total visitors for different genres are similar. Also, we can see that Izakaya has the highest frequency in the dataset and most total visitors, while International Cuisine has the lowest grequency in the data set and least total visitors. Therefore, we can infer that Izakaya can predict more visitors.
(2). Does the location has an impact on popularity?
Here, we conduct analysis between longitude, latitude, and the amount of visitors.
# The amount of visitors VS Longitude
ggplot(final_data, aes(x = longitude, y = visitors))+
geom_point(color = "darkred", alpha=0.5, size = 0.7)# The amount of visitors VS Latitude
ggplot(final_data, aes(x = latitude, y = visitors))+
geom_point(color = "darkred", alpha=0.5, size = 0.7)
From the two plots above, we can conclude that there is not obvious relationship between the amount of visitors and longitude. However, it seems that the lower the latitude, the more amount of visitors, indicating that restauants in lower latitude area are more popuplar among Janpanese. By the way, this kind of relationship may be uncorrect due to some confounding variables, like the level of development and weather. And we will dig into weather-related variables in Part II.
General view: how does the number of visitors vary according to different visit dates?
# The relationship between visit_date and the amount of visitors.
ggplot(final_data) +
geom_point(aes(x = visit_date, y = visitors), alpha = 0.5, color = "blue3", size = 0.7)+
labs(x = "Visit Date",
y = "Count of Visitors",
title = "How does the number of visitors vary among different dates?")+
theme(plot.title = element_text(hjust = 0.5))
It can be observed from the scatter plot that for most restaurants, it’s daily visitors are less than 125 from January 2016 to April 2017, only 5 outliers have visitors beyond 500.
Visitors VS. Week Days: Which day of week do visitors prefer to eat out?
week_visit <- final_data %>%
group_by(day_of_week)%>%
summarise(visitors = sum(visitors))
most_day <- ggplot(final_data, aes(day_of_week))+
geom_bar(alpha=0.8,fill = "#338658")+
xlab("day of week") +
ggtitle("Bar chart for Day of Week")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
most_visitor <- ggplot(week_visit, aes(x = day_of_week, visitors))+
geom_bar(stat = "identity",fill="#c1c084")+
scale_colour_hue(c=45, l=80)+
xlab("day of week") +
ylab("Visitors") +
ggtitle("Bar chart between week days and visitors")+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(most_day, most_visitor, ncol = 2)
There are two plots above, the left one shows the distrubution of day of weeks, which means Friday is the most frequent day that people would choose to go out. The right graph represents the sum of visitors of that day of week, which we can clearly see that Saturday has more visitors totally compared with other days. It’s very interesting to combine these two graphs together because visitors come to the restaurants more often on Friday, but more visitors come on Saturday. Is this mismatch issue due to extreme values? We will discover it next.
First, we are going to find those extreme values, namely outliers.
ggplot(final_data)+
geom_boxplot(aes(x = "", y = visitors))+
labs(x = "",
y = "visitors",
title = "Distribution of Visitors")
From the box plot, we can observe that there are few visit dates whose visitors are above 250, compared with our whole dataset. We delete these outliers to exclude the impact of extreme values.
final_data_sub <- subset(final_data, visitors < 250)Then, we are going to analyze the data for business days and weekends seperately.
# The distributions of visitors of business days.
business_day <- subset(final_data_sub, day_of_week == c("Monday","Tuesday","Wednesday","Thursday","Friday"))
ggplot(business_day,aes(visitors,fill = day_of_week))+
geom_histogram()+
facet_wrap(~ day_of_week)+
theme_wsj()+
scale_fill_wsj()+
guides(fill=guide_legend(title=NULL)) # The distributions of visitors of weekends.
weekend <- subset(final_data_sub, day_of_week == c("Saturday","Sunday"))
ggplot(weekend,aes(visitors,fill = day_of_week))+
geom_histogram()+
facet_wrap(~ day_of_week)+
theme_wsj()+
scale_fill_wsj()+
guides(fill=guide_legend(title=NULL))
Apart from that, we also plot a general distribution for eat out ratio and distribution, combining the business days and weekends together.
day_eat_out_ratio <-
ggplot(final_data_sub, aes(x= visitors, fill = day_of_week)) +
geom_histogram()+
labs(x = "visitors",
y = "count",
fill = "day of week",
title = "Which day of week do visitors prefer to eat out?")+
scale_fill_brewer(palette = "Spectral")+
theme(plot.title = element_text(hjust = 0.5))
day_eat_out_ratio
Based on the four graphs above, we confirm that even though extreme values are excluded, the result is the same Friday is the most frequent day that people would choose to go out while Saturday has more visitors totally compared with other days.
Visitors VS. Holidays: Are the amount of visitors different between holidays and non-holidays?
Then, we are curious about whether holidays and non-holidays have different amount of visitors.
To do this, we need to make holiday_flg factor first, and assign ‘holiday’ and ‘non-holiday’ to ‘0’ and ‘1’, respectively.
final_data$holiday_flg <- as.factor(final_data$holiday_flg)
levels(final_data$holiday_flg) <- c('nonholiday', 'holiday')
table(final_data$holiday_flg)##
## nonholiday holiday
## 239333 12775
paste('Now there are', nrow(subset(final_data, holiday_flg == "nonholiday")), 'nonholidays and', nrow(subset(final_data, holiday_flg == "holiday")), 'holidays')## [1] "Now there are 239333 nonholidays and 12775 holidays"
ggplot(data = final_data) +
geom_histogram(aes(x = visitors, fill = holiday_flg), position = "dodge")+
labs(x = "visitors",
y = "count",
fill = "holiday or not",
title = "Is the amount of visitors different between holidays and non-holidays?")+
scale_fill_manual(values = alpha(c("blue", "red", "yellow"), .6))
There are huge difference of total amount of visitors between holidays and nonholidays. However, this differentiation may be due to different scales of holidays and nonholidays. We need to do deeper analysis by counting visitors per holiday and non-holiday.
# Visitors per holiday
total_visitors <- final_data %>%
group_by(holiday_flg)%>%
summarise(visitors = sum(visitors))
visitors_per_holiday <- total_visitors[2,2]/nrow(subset(final_data, holiday_flg == "holiday"))
paste('There are', visitors_per_holiday, 'visitors per holiday')## [1] "There are 23.7033268101761 visitors per holiday"
# Visitors per non-holiday
visitors_per_non_holiday <- total_visitors[1,2]/nrow(subset(final_data, holiday_flg == "nonholiday"))
paste('There are', visitors_per_non_holiday, 'visitors per non-holiday')## [1] "There are 20.8280638273869 visitors per non-holiday"
Thus, when we exclude scale factors, we can see that actually there are more visitors during holidays since visitors per holiday are more than visitors per non-holiday. Thus, we infer that Janpanese prefer to enjoy holidays with their families, similar to many cultures around the world.
In the first part of data transformation, we did the data cleaning before we conduct the EDA. In this part, we will transform several useful variables based on our finding during EDA to capture more information and thus improve the performation of our model.
In the EDA part, we find that usually Friday, Saturday and Sunday has more visitors, which demonstrate the importance of weekend in our setting. Therefore, we need to create a indicator variable to indentified whether or not a date is weekend.
final_data <- final_data %>%
mutate(is_weekend = case_when(day_of_week == "Saturday" ~ "weekend",
day_of_week == "Sunday" ~ "weekend",
day_of_week == "Monday" ~ "notweekend",
day_of_week == "Tuesday" ~ "notweekend",
day_of_week == "Wednesday" ~ "notweekend",
day_of_week == "Thursday" ~ "notweekend",
day_of_week == "Friday" ~ "notweekend"))
final_dataGenerally speaking, people are more likely to go out during holidays. For some special holidays, people hang out even before or after the holiday. Therefore, we decide to create two variables to indentify whether pervious or next day of this day is holiday.
holiday <- final_data %>%
select(air_store_id, visit_date) %>%
mutate(dayminus = visit_date - 1)
holiday <- holiday %>%
left_join(date_info, by = c("dayminus" = "calendar_date")) %>%
mutate(pres_holiday = holiday_flg) %>%
select(-holiday_flg, - day_of_week, - dayminus)
final_data <- left_join(holiday,final_data, by = c("visit_date" = "visit_date", "air_store_id" = "air_store_id"))holiday <- final_data %>%
select(air_store_id, visit_date) %>%
mutate(dayplus = visit_date + 1)
holiday <- holiday %>%
left_join(date_info, by = c("dayplus" = "calendar_date")) %>%
mutate(next_holiday = holiday_flg) %>%
select(-holiday_flg, - day_of_week, - dayplus)
final_data <- left_join(holiday,final_data, by = c("visit_date" = "visit_date", "air_store_id" = "air_store_id"))
final_dataAs we saw in EDA part, actually there exist certain pattern of visitors within each month. Our guess is that it may relates to the day people get paid in their job. To capture this underline trend, we create a new variable indicates day of the month. Also, we create the month variable to differenciate month.
final_data <- final_data %>%
mutate(month = month(visit_date))final_data <- final_data %>%
mutate(day_of_month = mday(visit_date))The variable air_area_name shows the area that the store locates. Currently there are 103 indentified areas in our dataset. Through analysing the value, we realized that the first one or two words indicated the city that the store locates. To avoid dummy all these 103 different areas in our model, which might lead to ovefitting, we extract the information of cities from this column and filter out the rest redundant information.
final_data %>%
group_by(air_area_name) %>%
summarise(sum(visitors))final_data <- final_data %>%
mutate(city = case_when(grepl('TÅkyÅ', air_area_name) ~'TÅkyÅ',
grepl('Fukuoka',air_area_name) ~'Fukuoka',
grepl('Hiroshima',air_area_name) ~'Hiroshima',
grepl('HokkaidÅ',air_area_name) ~'HokkaidÅ',
grepl('HyÅgo',air_area_name) ~'HyÅgo',
grepl('Miyagi',air_area_name) ~'Miyagi',
grepl('Niigata',air_area_name) ~'Niigata',
grepl('Åsaka',air_area_name) ~'Åsaka',
grepl('Shizuoka',air_area_name) ~'Shizuoka'))After finish feature engineering, we set those categorical variables as factor.
final_data$is_weekend <- as.factor(final_data$is_weekend)
final_data$month <- as.factor(final_data$month)
final_data$city <- as.factor(final_data$city)
final_data$day_of_month <- as.factor(final_data$day_of_month)
final_data$pres_holiday <- as.factor(final_data$pres_holiday)
final_data$next_holiday <- as.factor(final_data$next_holiday)
final_data$holiday_flg <- as.factor(final_data$holiday_flg)
final_data$day_of_week <- as.factor(final_data$day_of_week)
final_data$air_genre_name <- as.factor(final_data$air_genre_name)
final_data$air_area_name <- as.factor(final_data$air_area_name)And then, we can start to train our learners.
Firstly, we split our data into training data and test data. Also, we perform cross validation on the training data.
set.seed(1067)
model_data <- final_data %>%
select(-air_store_id, -visit_date, - air_area_name) #drop store_id, visit date
model_data <- na.omit(model_data)
model_datadata_split <- initial_split(model_data)
air_train <- training(data_split)
air_test <- testing(data_split)
cv_splits <- vfold_cv(air_train, v = 10)Then, we create our recipe for the final dataset, performing some final data transformation to facillitate our training. Considering our target variable visitors is right skewed, here we use logarithm to transform it. Also, we use step_other to group those sparse values in air_genre_name variable.
rec_mod <- recipe(visitors ~ .,
data = air_train) %>%
step_other(air_genre_name, threshold = 0.05) %>%
step_scale(all_numeric(),-visitors) %>%
step_center(all_numeric(),-visitors) %>%
step_dummy(all_nominal()) %>%
step_log(visitors)In this part, we are going to try three different models: Gradient Boosted Trees, Netural Network, and MARS.
The first learner we are using to train the model is Gradient Boosted Trees.
ctrl <- trainControl(method = 'cv', number = 10,
savePredictions = "final",
verboseIter = TRUE)Define the trainig grid.
grid_gbm <- expand.grid(interaction.depth = seq(20, 40, by = 5), #max tree depth
n.trees = 100, #boosting iteration with 100 trees
shrinkage = 0.1,
n.minobsinnode = 10)set.seed(2651)
cl <- makeCluster(8)
registerDoParallel(cl)
gbm_mod <- train(rec_mod,
data = air_train,
method = 'gbm',
metric = "RMSE",
tuneGrid = grid_gbm,
trControl = ctrl,
verbose = FALSE)## Loading required namespace: gbm
stopCluster(cl)Firstly, we want to have a quick look on our model.
gbm_mod## Stochastic Gradient Boosting
##
## 48981 samples
## 17 predictor
##
## Recipe steps: other, scale, center, dummy, log
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 44083, 44082, 44082, 44083, 44082, 44082, ...
## Resampling results across tuning parameters:
##
## interaction.depth RMSE Rsquared MAE
## 20 0.7909566 0.2508622 0.6087193
## 25 0.7893063 0.2533031 0.6067116
## 30 0.7894246 0.2526454 0.6068775
## 35 0.7901412 0.2509063 0.6072874
## 40 0.7901631 0.2506715 0.6068605
##
## Tuning parameter 'n.trees' was held constant at a value of 100
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 25, shrinkage = 0.1 and n.minobsinnode = 10.
As we can see from the graph below, the optimal tree for our learner has depth of 25.
ggplot(gbm_mod)Then, we make our prediction on the test set. Since we performed logarithm transformation on our training data before, we need to transform the target value back after the prediction. As we can see from the result below, the RMSE value is about 15.24 for the prediction on test set.
gbm_test <- predict(gbm_mod, newdata = air_test)
gbm_result <- postResample(pred = exp(gbm_test), obs = air_test$visitors)
print(gbm_result)## RMSE Rsquared MAE
## 15.2425176 0.2581045 10.2391292
And let’s draw out the prediction graph.
gbm_final <- data.frame(cbind(actuals = air_test$visitors, predicteds = exp(gbm_test)))
ggplot(gbm_final, aes(x = actuals, y = predicteds))+
geom_point()+
geom_smooth(method = 'lm', se = FALSE)+
geom_abline(color = 'red')The second learner we are using to train the model is MARS.
nn_grid <- expand.grid(size = c(2, 3, 4, 5,6,7),
decay = c(10^(-1),10^(-2)))set.seed(2651)
cl <- makeCluster(8)
registerDoParallel(cl)
set.seed(3555)
nn_mod <- train(
rec_mod,
data = air_train,
method = 'nnet',
maxit = 1000,
tuneGrid = nn_grid,
allowParallel = TRUE,
trControl = ctrl,
linout = TRUE)
stopCluster(cl)Let’s have a quick look on how our model looks like.
nn_mod## Neural Network
##
## 48981 samples
## 17 predictor
##
## Recipe steps: other, scale, center, dummy, log
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 44083, 44083, 44082, 44083, 44082, 44084, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 2 0.01 0.8349831 0.1625336 0.6513121
## 2 0.10 0.8283387 0.1761962 0.6446613
## 3 0.01 0.8213797 0.1901150 0.6383998
## 3 0.10 0.8226068 0.1876938 0.6387706
## 4 0.01 0.8155422 0.2015743 0.6330635
## 4 0.10 0.8183054 0.1962197 0.6350342
## 5 0.01 0.8173095 0.1983321 0.6339778
## 5 0.10 0.8174302 0.1980274 0.6337498
## 6 0.01 0.8158838 0.2012314 0.6325442
## 6 0.10 0.8164725 0.1999821 0.6332021
## 7 0.01 0.8182125 0.1971201 0.6355124
## 7 0.10 0.8127042 0.2073228 0.6293602
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7 and decay = 0.1.
Then, we plot out the tuning process of our neural network model. As we can see below, the optimal hidden units for 0.01 learning rate is 4 and 7 for 0.1 learning rate. Since neural network takes a lot of time to run, the range of parameters we choose to tune is limited so we may not receive the best outcome from our neural network model.
ggplot(nn_mod)Then, we make our prediction on the test set. As we can see from the result below, the RMSE value is about 15.99 for the prediction on test set.
nn_test <- predict(nn_mod, newdata = air_test)
nn_result <- postResample(pred = exp(nn_test), obs = air_test$visitors)
print(nn_result)## RMSE Rsquared MAE
## 15.9911671 0.1844395 10.6039848
Then, we draw out the prediction plot for our neural network learner.
nn_final <- data.frame(cbind(actuals = air_test$visitors, predicteds = exp(nn_test)))
ggplot(nn_final, aes(x = actuals, y = predicteds))+
geom_point()+
geom_smooth(method = 'lm', se = FALSE)+
geom_abline(color = 'red')The third learner we are using to train the model is MARS.
cl <- makeCluster(8)
registerDoParallel(cl)
set.seed(2651)
mars_mod <- train(
rec_mod,
data = air_train,
method = "gcvEarth",
tuneGrid = data.frame(degree=1:2),
trControl = ctrl
)## Loading required namespace: earth
## Loading required package: earth
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
## Loading required package: TeachingDemos
stopCluster(cl)Let’s have a quick look on how our model looks like.
mars_mod## Multivariate Adaptive Regression Splines
##
## 48981 samples
## 17 predictor
##
## Recipe steps: other, scale, center, dummy, log
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 44083, 44082, 44082, 44083, 44082, 44082, ...
## Resampling results across tuning parameters:
##
## degree RMSE Rsquared MAE
## 1 0.8336576 0.1656749 0.6511575
## 2 0.8180619 0.1966257 0.6351473
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was degree = 2.
Then, we plot out the tuning process of the MARS model.
ggplot(mars_mod)Then, we make our prediction on the test set. As we can see from the result below, the RMSE value is about 15.81 for the prediction on test set.
mars_test <- predict(mars_mod, newdata = air_test)
mars_result <- postResample(pred = exp(mars_test), obs = air_test$visitors)
print(mars_result)## RMSE Rsquared MAE
## 15.8096133 0.1992555 10.7415795
After that, we draw out the prediction plot.
mars_final <- data.frame(cbind(actuals = air_test$visitors, predicteds = exp(mars_test)))
ggplot(mars_final, aes(x = actuals, y = predicteds))+
geom_point()+
geom_smooth(method = 'lm', se = FALSE) +
geom_abline(color = 'red')Finally, we plot out the variable importance plot to see which attribute has the largest contribution on our learner.
mars_imp <- varImp(mars_mod)
ggplot(mars_imp, top=20) + xlab("")As it can be seen from the variable importance graph, variables like the number of reserved visitors, whether it is Saturday or not, location has the most highest contribution on our MARS learner. This is a pretty reasonable finding and what we can learn from our model is our prediction are correlated to attributes like number of reserved vistors and location.
In this part, we compare three models and choose our final model.
r_rmse_test <- c(gbm_result[1], nn_result[1],mars_result[1])
r_rsq_test <- c(gbm_result[2], nn_result[2],mars_result[2])
title <- c("Gradient Boosted Trees", "Netural Network","gcvEarth")
r_table <- data.frame("Model" = title, "RMSE_testing"=r_rmse_test, "RSquared_testing" = r_rsq_test)
r_tablep1<-ggplot(r_table, aes (x = reorder(Model,-RSquared_testing), y = RSquared_testing))+
geom_bar(stat = 'identity',aes(fill = Model))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
guides(fill = FALSE)+
scale_fill_brewer(palette = "Set1")+
labs(x = "Model",
y = "RSquared",
title = "RSquared in testing set")
p2<-ggplot(r_table, aes (x = reorder(Model,RMSE_testing), y = RMSE_testing))+
geom_bar(stat = 'identity',aes(fill = Model))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
guides(fill = FALSE)+
scale_fill_brewer(palette = "Set1")+
labs(x = "Model",
y = "RMSE",
title = "RMSE in testing set")
grid.arrange(p1,p2,ncol = 2)As we can see, the GBM model has the highest R-squared value and lowest RMSE value, which indicates it might be our best model. However, to validate our choice, let’s use possteriors analysis to evaluate their performance.
rs <- resamples(
list(gbm = gbm_mod, nn = nn_mod, mars = mars_mod)
)
rmse_mod <- perf_mod(rs, seed = 5391, iter = 5000, metric = "RMSE")posteriors <- tidy(rmse_mod, seed = 973254)
summary(posteriors)diff <- contrast_models(
rmse_mod,
#list_1 = c('gbm','mars'),
list_1 = c('nn','gbm'),
list_2 = c('mars','nn'),
seed = 2359
)
summary(diff, size = 0.025)diff %>%
mutate(contrast = paste(model_2, "vs", model_1)) %>%
ggplot(aes(x = difference, col = contrast)) +
geom_line(stat = "density") +
geom_vline(xintercept = c(-0.025, 0.025), lty = 2)As we can see from the result, the MARS and netural network model perform quite similar in the test set. However, the GBM perform a little bit better than these two models. Therefore, we choose the GBM model as our final model.
As we analyzed above, we decide to choose our GBM learner as our final model. Firstly, let’s have a quick look again on how our final model performs.
gbm_mod$finalModel## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 69 predictors of which 69 had non-zero influence.
gbm_test <- predict(gbm_mod, newdata = air_test)
gbm_result <- postResample(pred = exp(gbm_test), obs = air_test$visitors)
print(gbm_result)## RMSE Rsquared MAE
## 15.2425176 0.2581045 10.2391292
gbm_final <- data.frame(cbind(actuals = air_test$visitors, predicteds = exp(gbm_test)))
ggplot(gbm_final, aes(x = actuals, y = predicteds))+
geom_point()+
geom_smooth(method = 'lm', se = FALSE)+
geom_abline(color = 'red')The conclusion we can draw from our project is that, our gradient boosting trees model outperforms two other models: neural network and MARS in terms of both root mean squared errors and R squared value. Since the target of our prediction is numerical data, we choose RMSE as our loss function and the RMSE of our final prediction is about 15.24. Considering the limited information provided, this value of RMSE is acceptable.
To conclude, it is a valuable project for all of us since it needs many feature engineering, data cleaning and data transformation techniques in order to do the training. Also, there are some points could be improved if we are given more time. For instance, we limited our range of running parameters because it took so much time on running the models in RStudio. Additionally, there are other models we could consider like SVM and bagging method.
After choosing this final model, the business application we can perform with it is to predict the number of visitors for candidate restaurants. By doing so, we can build cooperation with our candidate restaurants to provide several type of services with our customer prediction model.
1. Improve resource allocation
Predicting number of customers for the following days would allow restaurants to optimize the resource allocation like rescheduling opening time. This could help them reduce budget on a lot of things like human resources.
For instance, a restaurant could have a better knowledge on having how many employees in the store for a normal weekend based on our model. Additionally, we can draft customized restaurant operation strategies for our customers. If we predict there are going to be more visitors than usual, we can suggest the restaurant to hire some part-time employees and order more materials.
2. Design marketing strategy
Another application of our model is to provide suggestion on marketing strategy. Also, the restaurant could combine their marketing strategy with our prediction result. For most restaurants, we predict that there will be fewer customers during weekdays. Acknowledged that, we can suggest restaurant to provide coupon that only can be used during weekdays to attract more visitors and thus avoid resource waste during those days.
3. Predict further revenue
Based on prediction of visitors for each day, month, year, the restaurant can easily make a further prediction of their future revenue based on their needs. This prediction can be valuable, especially for those multiunit restaurants. If our prediction model shows that there might be fewer customers for a specific store in the future, the company might consider to improve their service to generate more revenue.